Bayesian Logistic Regression - Application in Probability Prediction of disease (Diabetes)

CapStone Project_2025

Author

Namita Mishra (Advisor: Dr. Ashraf Cohen)

Published

September 29, 2025

Slides: slides.html ( Go to slides.qmd to edit)


Introduction

Diabetes mellitus (DM) is a major public health concern. Identifying risk factors associated with DM (obesity, age, race, and gender) and targeted interventions for prevention is crucial. Logistic Regression estimates the association between risk factors and binary outcomes (presence or absence of diabetes). However, classical maximum likelihood estimation (MLE) yields unstable results in small samples with missing data, or quasi- and complete separation. Standard analytical approaches are insufficient in analyzing the complexity of healthcare data (DNA sequences, imaging, patient-reported outcomes, electronic health records (EHRs), longitudinal health measurements, diagnoses, and treatments. (Zeger et al., 2020) The Bayesian hierarchical model with Markov Chain Monte Carlo (MCMC) has been implemented on multivariate longitudinal healthcare data by integrating prior knowledge to predict patient health status (Zeger et al., 2020). Model with two levels of data structure: (1) repeated measures over time within individuals and (2) individuals nested within a population, with added exogenous covariates (e.g., age, clinical history), and endogenous covariates (e.g., current treatment), yield posterior distributions of parameters. MCMC estimation provides marginal distributions of the regression coefficients in risk prediction (pneumonia, prostate cancer, and mental disorders). The model’s limitation is its parametric nature, suggesting nonparametric or more flexible parametric models.

Application of Bayesian Inference Chatzimichail and Hatjimihail (2023), comparing parametric (with a fixed set of parameters) and non-parametric distributions (which do not make a priori assumptions) on National Health and Nutrition Examination Survey data from two separate diagnostic tests on both diseased and non-diseased populations, and provides posterior probability classifying diseases. Conventional methods based on clinical criteria and fixed numerical thresholds fail to capture the intricate relationship between diagnostic tests and the prevalence of the diseases; the dichotomous method (overlap of probability distributions between the diseased and nondiseased groups) fails to capture the complexity and heterogeneity across diverse populations. Its applicability in dealing with skewness, bimodality, or multimodality is critiqued. Integration of priors, combined with data from multiple diagnostic tests, improves diagnostic accuracy and precision. Bayesian nonparametric (vs parametric) is a flexible, adaptable, versatile, and robust approach, capturing complex data patterns, producing multimodal probability patterns vs the bimodal, double-sigmoidal curves in parametric models. Limited scholarly publications and over-dependence on priors limit model applicability. When combined with other statistical and computational techniques, it enhances diagnostic capabilities Chatzimichail and Hatjimihail (2023) in situations with systemic bias, unrepresentative, incomplete, and non-normal datasets.

Bayesian methodology explained by Schoot et al. (2021) emphasizes the importance of priors, data modeling, inferences, model checking, sampling techniques from a posterior distribution, variational inferences, and variable selection for applicability across varied research fields (social sciences, ecology, genetics, and medicine). The variable selection is emphasized because multicollinearity, insufficient sampling, and overfitting result in poor predictive performance and difficult interpretation difficult. Prior types (informative, weakly informative, and diffuse) are based on the degree of (un)certainty (hyperparameters) surrounding the population parameter; a larger variance represents greater uncertainty; a mildly informative prior analyzes small sample sizes. Using both prior elicitation methods (experts, generic experts, data-based, and sample data using maximum likelihood or sample statistics), and prior sensitivity analysis of the likelihood assesses how the priors and the likelihood align. Prior provides data-informed shrinkage, regularization, or influence algorithms, providing a high-density region, improving estimation. Specification of the priors, in small and less informative samples, strengthens the observed data to lend possible value(s) for the unknown parameter(s). With unknown parameters having varied values, observed data having fixed values, and the likelihood function generating a range of possible values, integrating the MCMC algorithm for sampled values from a given distribution through computer simulations provides empirical estimates of the posterior distribution (BRMS and Blavaan in R). The frequentist method does not consider the probability of the unknown parameters and considers them as fixed, while likelihood is based on the conditional probability distribution p(y|θ) of the data (y), given fixed parameters (θ). Spatial and temporal variability factored into Bayesian models has varied applicability (large-scale cancer genomic data, identifying novel molecular-level changes, interactions between mutated genes, capturing mutational signatures, allowing genomic-based patient stratification (clinical trials, personalized use of therapeutics), and understanding cancer evolutionary processes). The Bayesian model is reproducible, but autocorrelation in the temporal model and subjectivity issues in prior elicitation are limitations.

Prior elicitation, analytical posteriors, robustness checks in Bayesian Normal linear regression, and parametric (conjugate) model incorporating Normal–Inverse-Gamma prior have been demonstrated in metrology Klauenberg et al. (2015) to calibrate instruments. In Gaussian, errors are independent and identically distributed, the variance is unknown, the relationship between X and Y is statistical, with noise and model uncertainty, and the regression can not be treated as a measurement function. Methods like Likelihood, Bayesian, bootstrap, etc., quantify uncertainty, account for a priori information, and robustify analyses through prior elicitation and posterior calculation. Observables (data) and unobservables (parameters and auxiliary variables) are unknown and random, and the assigned probability distributions update prior knowledge about the unobservables, enhancing the elicitation and interpretation process. Normal Inverse Gamma (NIG) distribution to a posterior is from the same family of (NIG) prior distribution, and the known variance (conjugate prior) can drive vague or non-informative prior distributions (2). An alternative (hierarchical) prior provides an additional layer of distributions to uncertainty.

Bayesian Hierarchical / meta-analytic linear regression model (DeLeeuw2012?) augments data incorporating both exchangeable and unexchangeable information on regression coefficients (and standard errors) from previous studies, addressing multiple testing associated with low statistical power, issues of conducting separate significance tests for all regression coefficients in modest sample sizes across studies with different predictors, and the need for larger samples. Linear regression does not incorporate priors, produce smaller estimates that are unreliable and vulnerable to sample variations. In situations where previous studies are unavailable, priors from meta-analysis in Bayesian regression address the challenge of large sample size, supplementing and resolving the limitations of univariate analyses that overlook the relationships among multiple regression parameters within a study. With prior specifications from both prior data and current data, priors are categorized as: (1) Exchangeable when the current data and previous studies share the same set of predictors, and (2) Unexchangeable when the predictors differ across studies. (1) The probability density function for the data (using the Gibbs sampler), and (2) the likelihood function reflect prior assumptions about the model parameters before observing the data. The hierarchical unexchangeable model is applicable in studying differences in studies, enabling explicit testing of the exchangeability assumption, but the applicability is limited to the correlation issue of having identical set of predictors. (DeLeeuw, 2012).

Bayesian logistic regression (Bayesian GLM)**- A sequential clinical reasoning was applicable in screening adults (20–79 years, Taiwan), to classify incident cancers and cardiovascular disease. Three models with sequential adding of predictors: (1) demographic features (basic model), (2) six metabolic syndrome components (metabolic score model), and (3) conventional risk factors (enhanced model), and by incorporating priors, Liu (2013) demonstrated the model could address the limited availability of molecular information and is an alternative method leveraging routinely collected biological markers for classification of diseases. In contrast, the Framingham Risk Score is limited by geographic, ethnic, and social contextual heterogeneity across populations. Emulates a clinician’s evaluation process, the model assumes normally distributed regression coefficients, accounts for uncertainty in clinical weights, and averages credible intervals for predicted risk estimates. By updating prior clinical expectations with objective observed data (patient history and laboratory findings), the posterior distributions produced (Enhanced model) showed that patient background significantly contributed to baseline risk estimation by integrating individual characteristics that capture ecological heterogeneity. The model limitations are the potential interactions between predictors and external cross-validation. Bayesian multiple imputation with logistic regression, (Austin?) 2021, addresses missing data in clinical research. Analyzing reasons of missing values (i) patients refusing to answer specific questions, (ii) loss to follow-up, (iii) investigator or mechanical errors, or (iv) physicians choosing not to order certain investigations require understanding of the type of missingness: missing at random (MAR), missing not at random (MNAR), or missing completely at random (MCAR) and addressing missingness by multiple imputation (MI) ( R, SAS, or Stata) classify patients with heart failure and provide estimates of 1-year mortality. Plausible values are generated by MI, creating multiple completed datasets while simultaneously conducting identical statistical analyses across them, providing robust estimates through pooled results.

Aims

The present study focuses on the application of Bayesian logistic regression to predict diabetes status based on body mass index (BMI), age, gender, and race as predictors using a retrospective dataset (2013–2014 NHANES survey). The dataset reveals challenges such as quasi-separation, missing values, and a relatively small effective sample size, and the traditional logistic regression has limitations in dealing with these anomalies. Initial data exploration yielded 9,813 observations across five selected variables. The results from complete case analysis, listwise deletion, substantially reduced the sample size to only 14 complete cases and presented quasi-separation with implausibly large coefficients and unstable estimates. The analytic limitations of traditional logistic regression motivate us to perform Multiple Imputation by Chained Equations (MICE) in conjunction with Bayesian logistic regression. The approach could provide a flexible framework for modeling uncertainty, incorporating prior knowledge, and addressing issues related to quasi-separation and limited sample size.

Method and Data Preparation

Statistical Tool used is R, R packages, and libraries to import, manage and analyze the data. Data collected is from NHANES 2-year cross-sectional data (2013-2014 year) using 3 datasets (demographics, exam, questionnaire) Center for Health Statistics (1999). Using haven package .XPT files were imported in r-studio modified to dataframe (df).

Code
# loading packages 
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("nhanesA")
library ("nhanesA")    

library(tidyverse)
library(knitr)
library(ggthemes)
library(ggrepel)
library(dslabs)
library(Hmisc)
library(dplyr)
library(tidyr)
library(forcats)
library(ggplot2)

library(classpackage)
library(janitor)
install.packages("gt")   
library(gt)
library(survey)
library(DataExplorer)
library(logistf)

Data Management

Subsets created from the original weighted 3 datasets (demographics, exam, questionnaire) were merged into a single dataframe for analysis and exploration. The merged dataframe included selected variables of interest, was cleaned, variables categorized, and recoded and analyzed. Basic statistics, anamolies and patterns reported before running Bayesian regression. Final dataset included weighted means of all selected variables of interest. Below tabel describes the details of all variable in the data.

  1. Response Variable (Binary, Diabetes) was defined as - “Is there one Dr you see for diabetes”

  2. Predictor Variables (Body Mass Index, factor, 4 levels)

    The original data has BMDBMIC (measured BMI) as categorical and had no missing values. It (BMI) has the following 4 levels:
    o Underweight (<5th percentile)
    o Normal (5th–<85th)
    o Overweight (85th–<95th) o Obese (≥95th percentile)
    o Missing We kept it as it is as categorization provides clinically interpretable groups

  3. Covariates:

  • Gender (factor, 2 levels): Male: Female
  • Ethnicity (factor, 5 levels): Mexican American, Non-Hispanic, White Non-Hispanic, Black Other Hispanic, Other Race - Including Multi-Racial
  • Age (num, continuous)
Code
library(gt)
# formation of table with variable details

variables <- c("ID",       "BMI" ,     "Age",      "Gender" ,  "Race",     "PSU",      "Strata",   "Weight" ,  "Diabetes")
df <- data.frame(Variable = variables, Description = descriptions <- c("Respondent sequence number", 
        "BMI calculated as weight in kilograms divided by height in meters squared, and then rounded to one decimal place.", 
                  "Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of Age.", 
                  "Gender", 
                  "Race/ethnicity Recode of reported race and Hispanic origin information", 
                  "Sample PSU", 
                  "Sample strata", 
                  "MEC exam weight", 
                  "Diabetes status Is there one doctor or other health professional {you usually see/SP usually sees} for {your/his/her} diabetes? Do not include specialists to whom {you have/SP has} been referred such as diabetes educators, dieticians or foot and eye doctors."))
df %>%
  gt %>%
  tab_header(
    title = "Table Variable Description"
  ) %>%
  tab_footnote(
    footnote = "Each variable in the dataset, accompanied by a qualitative description from the study team."
  )
Table Variable Description
Variable Description
ID Respondent sequence number
BMI BMI calculated as weight in kilograms divided by height in meters squared, and then rounded to one decimal place.
Age Age in years of the participant at the time of screening. Individuals 80 and over are topcoded at 80 years of Age.
Gender Gender
Race Race/ethnicity Recode of reported race and Hispanic origin information
PSU Sample PSU
Strata Sample strata
Weight MEC exam weight
Diabetes Diabetes status Is there one doctor or other health professional {you usually see/SP usually sees} for {your/his/her} diabetes? Do not include specialists to whom {you have/SP has} been referred such as diabetes educators, dieticians or foot and eye doctors.
Each variable in the dataset, accompanied by a qualitative description from the study team.

The data structure, a plot of the data and the breakdownof the missingness is presented

Code
# weighted means of each variable                       
str(merged_data)
'data.frame':   9813 obs. of  9 variables:
 $ ID      : num  73557 73558 73559 73560 73561 ...
 $ BMI     : Factor w/ 4 levels "Underweight",..: NA NA NA 2 NA NA NA NA NA NA ...
 $ Age     : num  69 54 72 9 73 56 0 61 56 65 ...
 $ Gender  : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 1 1 2 2 1 ...
 $ Race    : Factor w/ 5 levels "Mexican American",..: 4 3 3 3 3 1 3 3 3 3 ...
 $ PSU     : num  1 1 1 2 2 1 1 1 1 2 ...
 $ Strata  : num  112 108 109 109 116 111 105 114 112 112 ...
 $ Weight  : num  13481 24472 57193 55767 65542 ...
 $ Diabetes: Factor w/ 2 levels "Yes","No": 1 1 2 NA NA NA NA NA NA NA ...
Code
plot_str(merged_data)
introduce(merged_data)
  rows columns discrete_columns continuous_columns all_missing_columns
1 9813       9                4                  5                   0
  total_missing_values complete_rows total_observations memory_usage
1                15381            14              88317       554872
Code
plot_intro(merged_data, title="Figure 1 (Merged dataset). Structure of variables and missing observations.")

Code
plot_missing(merged_data, title="Figure 2(Merged dataset). Breakdown of missing observations.")

  • Raw Data Exploration and Visualization

  • Most cases are non-hispanic whites.

  • Of those who reported their diabetes status, shows more counts reported having diabetes.

  • Genders are relatively evenly distributed.

  • Majority population were in the normal range of BMI

  • Histograms (Figure 5) shows frequency distributions for Age, with slight right skewness.

Code
# Correct survey design
nhanes_design <- svydesign(
  id = ~PSU,        # primary sampling unit
  strata = ~Strata, # stratification variable
  weights = ~Weight,# survey weights
  data = merged_data,
  nest = TRUE
)

# Weighted proportion of Diabetes
svymean(~Diabetes, design = nhanes_design, na.rm = TRUE)
               mean    SE
DiabetesYes 0.78759 0.021
DiabetesNo  0.21241 0.021
Code
svymean(~Age , design = nhanes_design, na.rm = TRUE)
      mean     SE
Age 37.504 0.4412
Code
svymean(~BMI, design = nhanes_design, na.rm = TRUE)
                     mean     SE
BMIUnderweight   0.037703 0.0037
BMINormal weight 0.628530 0.0120
BMIOverweight    0.162217 0.0056
BMIObese         0.171551 0.0109
Code
svymean(~Gender, design = nhanes_design, na.rm = TRUE)
                mean     SE
GenderMale   0.48861 0.0065
GenderFemale 0.51139 0.0065
Code
svymean(~Race, design = nhanes_design, na.rm = TRUE)
                                            mean     SE
RaceMexican American                    0.110450 0.0199
RaceOther Hispanic                      0.060637 0.0106
RaceNon-Hispanic White                  0.622315 0.0365
RaceNon-Hispanic Black                  0.120895 0.0173
RaceOther Race - Including Multi-Racial 0.085704 0.0076

Using library(survey), we extracts the weighted means and sd of variables from the data having 9813 observations.

Explain your data preprocessing and cleaning steps.

  • Special codes in the survey are not random and cannot be dropped. Since it could introduce bias as the informative missingness if ignored (MAR or MNAR). They were transformed into NAs (based on the variable codebook). All NAs were included in the analysis, since, R automatically excludes rows with NA during during regression.
  • We conducted linear regression lm (), and (listwise deletion or complete case analysis) resulted in a reduced sample size (n=14).
  • We observed quasi-separation (warning) on our dataset.
  • The results all NAs removed and the breakdown of the missingness are presented below
  • Descriptive statistics (counts, frequencies, proportions, mean and sd ). Visualization of each variable and the proportions of variable/s are presented here.
  • Below are shown frequency plots for both continuous and categorical variables
  • A tabular view of counts, proprtions of all variables id created
Code
library(dplyr)
library(knitr)


# 1. continuous variable summary
cont_summary <- merged_data %>%
  summarise(
    Mean = round(mean(Age, na.rm = TRUE), 2),
    SD   = round(sd(Age, na.rm = TRUE), 2),
    Min  = min(Age, na.rm = TRUE),
    Max  = max(Age, na.rm = TRUE)
  ) %>%
  mutate(
    Variable = "Age",
    Category = "Continuous",
    Count = nrow(merged_data) - sum(is.na(merged_data$Age)),
    Proportion = NA
  ) %>%
  select(Variable, Category, Count, Proportion, Mean, SD, Min, Max)

# 2. categorical variables
cat_summary <- function(df, var, name) {
  df %>%
    count({{var}}) %>%
    mutate(
      Proportion = round(n / sum(n), 3),
      Variable = name,
      Mean = NA, SD = NA, Min = NA, Max = NA
    ) %>%
    rename(Category = {{var}}, Count = n) %>%
    select(Variable, Category, Count, Proportion, Mean, SD, Min, Max)
}

# 3. summaries for categorical variables
gender_summary <- cat_summary(merged_data, Gender, "Gender")
BMI_summary    <- cat_summary(merged_data, BMI, "BMI Category")
race_summary   <- cat_summary(merged_data, Race, "Race")
diab_summary   <- cat_summary(merged_data, Diabetes, "Diabetes")

# 4. combine all into one table
final_table <- bind_rows(
  cont_summary,
  gender_summary,
  BMI_summary,
  race_summary,
  diab_summary
)

# 5. display as a table
kable(final_table, caption = "Table 1. Descriptive Statistics of Study Variables")
Table 1. Descriptive Statistics of Study Variables
Variable Category Count Proportion Mean SD Min Max
Age Continuous 9813 NA 31.63 24.4 0 80
Gender Male 4831 0.492 NA NA NA NA
Gender Female 4982 0.508 NA NA NA NA
BMI Category Underweight 132 0.013 NA NA NA NA
BMI Category Normal weight 2167 0.221 NA NA NA NA
BMI Category Overweight 595 0.061 NA NA NA NA
BMI Category Obese 629 0.064 NA NA NA NA
BMI Category NA 6290 0.641 NA NA NA NA
Race Mexican American 1685 0.172 NA NA NA NA
Race Other Hispanic 930 0.095 NA NA NA NA
Race Non-Hispanic White 3538 0.361 NA NA NA NA
Race Non-Hispanic Black 2198 0.224 NA NA NA NA
Race Other Race - Including Multi-Racial 1462 0.149 NA NA NA NA
Diabetes Yes 553 0.056 NA NA NA NA
Diabetes No 169 0.017 NA NA NA NA
Diabetes NA 9091 0.926 NA NA NA NA
Code
library(gt)

# Cross-tabulation: Diabetes vs BMI
tab1 <- table(merged_data$Diabetes, merged_data$BMI, useNA = "ifany")
prop.table(tab1) * 100  # overall percentages
      
       Underweight Normal weight  Overweight       Obese        <NA>
  Yes   0.00000000    0.05095282  0.03057169  0.02038113  5.53347600
  No    0.00000000    0.02038113  0.02038113  0.00000000  1.68144298
  <NA>  1.34515439   22.01161724  6.01243249  6.38948334 56.88372567
Code
# Cross-tabulation: Race vs Diabetes
tab2 <- table(merged_data$Race, merged_data$Diabetes, useNA = "ifany")
prop.table(tab2) * 100  # overall percentages
                                     
                                             Yes         No       <NA>
  Mexican American                     0.8763885  0.4178131 15.8768980
  Other Hispanic                       0.4891470  0.1528585  8.8352186
  Non-Hispanic White                   2.0992561  0.5706716 33.3842862
  Non-Hispanic Black                   1.4878223  0.3668603 20.5441761
  Other Race - Including Multi-Racial  0.6827678  0.2140018 14.0018343
Code
# Cross-tabulation: Gender vs Diabetes
tab3 <- table(merged_data$Gender, merged_data$Diabetes, useNA = "ifany")
prop.table(tab3) * 100  # overall percentages
        
                Yes         No       <NA>
  Male    2.6903088  0.8865790 45.6537247
  Female  2.9450729  0.8356262 46.9886885
Code
## Bar plot of Age, gender, race, diabetes status, BMI ## 

plot_bar(merged_data, title = "Figure 3(Merged dataset). Frequency plots of categorical variables.")

Method

We performed Bayesian Logistic Regression Model statistical analysis on our data based on the data characteristics as - binary outcome (Diabetes (yes/ no) -binomial based on probability Bayes rules - bayes rule linear relation between (predictor) X and (response) Y - regression discrete variable that can have two values, 0 or 1 (Bernoulli probability model) Classification tasks in regression analyze of categorical response variables -predicting or classifying the response category.

We later compare the two models

  1. Frequentist methods Multiple Logistic regression model, Baseline Rgression model Firth (penalized regression) Model
  2. Bayesian Logistic Regression Model

Below are the principles, concept and assumptions of the two models

  1. Multiple logistic regression
  • Multiple linear regression on raw dataset, resulted in small sample size (complete case analysis and listwise deletion of NAs): (presented are first 6 deleted rows)
  • The data resulted in quasi-separation. Van Buuren and Van Buuren (2012).
  • Explored of the cause of missingness, revealed missing at random and missing not at random (MAR and MNAR) whic as reported previously are common in healthcare and public health datasets: (plot on missingness - see below)
  1. Baseline regression model (Only BMI to predict diabetes)
  • We conducted baseline model regression to know whether predictors significantly improve predictive power.
  • Null deviance = 16.75 (baseline fit).
  • Residual deviance = 15.11 (with BMI).
  • presents that the drop is small and that BMI category adds very little predictive value over just assuming the overall diabetes prevalence.

The anomalies in data (quasi-separation was handled by performing Firth regression and the small dataset due to listwise deletion was managed by performing Multivariate Imputation by Chained Equations (MICE).

Concept and results of Firth regression and MICE presented below.

  1. Firth (penalized) regression

Firth (penalized) regression was considered to handle quasi-separation, D’Angelo and Ran (2025). Firth regression, a frquentist approach that use Jeffreys prior for bias correction. It does not provide posterior and no sampling using MCMC (vs) bayesian logisitic regression.

Below are the summaries from the MLR (raw data), Baseline model regression, Firth regression

Data exploration of the raw dataset

Unexpected reports, patterns or anomalies in the raw data

  • Issue of quasi-complete separation in the data (9799 observations dropped)
  • Reduced sample size with reduced number of complete cases (n=14).
  • The model is overfitted to this subset and cannot be generalized.
  • Huge coefficients (e.g., 94, –50, 73) and the tiny residual deviance suggest perfect separation and sparse data in some categories with very few observations, resulted in imbalance in the outcome (very few cases of 0 or 1).
  • Logistic regression cannot estimate stable coefficients when predictors perfectly classify the outcome.
  • Firth regression dealt with quasi-separation with coefficients as finite, but the sample size was reduced (n= 14) where estimates are highly uncertain, wide confidence intervals → cannot make strong claims about predictor effects.
  • multivariate missingness, non-monotone (arbitrary) missingness with connected pattern, since all variables were at least observed in some cases.
  • in order to be able to estimate a correlation coefficient between two variables, they need to be connected, either directly by a set of cases that have scores on both, or indirectly through their relation with a third set of connected data.

Interpretation of the 14 non-missing cases - could not be trusted because of small sample size and the separation problem - Models with all predictors together and with sequential adding of predictors, all models showed unstable and extreme estimates with standard errors not meaningful. - Adding more predictors makes the deviance drop but indicated overfitting / separation, not true explanatory power. - BMI alone contributes very little - Race and gender make models appear stronger, but was based on small sample (n=14) and shows a case of complete separation, not generalizable evidence.

We decided to perform imputation, to retain full N = ~9813 to deal with small sample size and avoid quasi-separation.

  1. Multivariate Imputation by Chained Equations (MICE) Buuren and Groothuis-Oudshoorn (2011) - Bayesian Approach
  • Multiple imputation (MI) is considered as an alternative approach for the given raw dataset. Flatness of the density, heavy tails, non-zero peakedness, skewness and multimodality do not hamper the good performance of multiple imputation for the mean structure in samples n > 400 even for high percentages (75%) of missing data in one variable @Van Buuren and Van Buuren (2012).
  • Multiple Imputation (MI) is a Bayesian Approach (use popular mice package in R) and adds sampling variability to the imputations.
  • Iterative Imputation (MICE) imputes missing values of one variable at a time, using regression models based on the other variables in the dataset.
  • This is a chain process, with each imputed variable becoming a predictor for the subsequent imputation and the entire process is repeated multiple times to create several complete datasets, each reflecting different possibilities for the missing data.
  • Each variable is imputed using its own appropriate univariate regression model.
Code
## Multiple Imputation performed 
# Subset variables for imputation in analytic_data df
library(dplyr)
library(ggplot2)
library(mice)
library(VIM)
library(janitor)

# 1. Select variables for imputation
vars <- c("ID", "BMI", "Age", "Gender", "Race", "PSU", "Strata", "Weight", "Diabetes")
analytic_data <- merged_data[, vars]

glimpse(analytic_data)
Rows: 9,813
Columns: 9
$ ID       <dbl> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73566…
$ BMI      <fct> NA, NA, NA, Normal weight, NA, NA, NA, NA, NA, NA, NA, Normal…
$ Age      <dbl> 69, 54, 72, 9, 73, 56, 0, 61, 56, 65, 26, 9, 76, 10, 10, 33, …
$ Gender   <fct> Male, Male, Male, Male, Female, Male, Male, Female, Female, M…
$ Race     <fct> Non-Hispanic Black, Non-Hispanic White, Non-Hispanic White, N…
$ PSU      <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1…
$ Strata   <dbl> 112, 108, 109, 109, 116, 111, 105, 114, 112, 112, 113, 115, 1…
$ Weight   <dbl> 13481.042, 24471.770, 57193.285, 55766.512, 65541.871, 25344.…
$ Diabetes <fct> Yes, Yes, No, NA, NA, NA, NA, NA, NA, NA, NA, NA, No, NA, NA,…
Code
glimpse(merged_data)
Rows: 9,813
Columns: 9
$ ID       <dbl> 73557, 73558, 73559, 73560, 73561, 73562, 73563, 73564, 73566…
$ BMI      <fct> NA, NA, NA, Normal weight, NA, NA, NA, NA, NA, NA, NA, Normal…
$ Age      <dbl> 69, 54, 72, 9, 73, 56, 0, 61, 56, 65, 26, 9, 76, 10, 10, 33, …
$ Gender   <fct> Male, Male, Male, Male, Female, Male, Male, Female, Female, M…
$ Race     <fct> Non-Hispanic Black, Non-Hispanic White, Non-Hispanic White, N…
$ PSU      <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 1…
$ Strata   <dbl> 112, 108, 109, 109, 116, 111, 105, 114, 112, 112, 113, 115, 1…
$ Weight   <dbl> 13481.042, 24471.770, 57193.285, 55766.512, 65541.871, 25344.…
$ Diabetes <fct> Yes, Yes, No, NA, NA, NA, NA, NA, NA, NA, NA, NA, No, NA, NA,…
Code
# 2. Run mice to create 5 imputed datasets
imputed_data <- mice(
  analytic_data,
  m = 5,              # number of imputed datasets
  method = 'pmm',     # predictive mean matching
  seed = 123
)

 iter imp variable
  1   1  BMI  Diabetes
  1   2  BMI  Diabetes
  1   3  BMI  Diabetes
  1   4  BMI  Diabetes
  1   5  BMI  Diabetes
  2   1  BMI  Diabetes
  2   2  BMI  Diabetes
  2   3  BMI  Diabetes
  2   4  BMI  Diabetes
  2   5  BMI  Diabetes
  3   1  BMI  Diabetes
  3   2  BMI  Diabetes
  3   3  BMI  Diabetes
  3   4  BMI  Diabetes
  3   5  BMI  Diabetes
  4   1  BMI  Diabetes
  4   2  BMI  Diabetes
  4   3  BMI  Diabetes
  4   4  BMI  Diabetes
  4   5  BMI  Diabetes
  5   1  BMI  Diabetes
  5   2  BMI  Diabetes
  5   3  BMI  Diabetes
  5   4  BMI  Diabetes
  5   5  BMI  Diabetes
Code
# 3. First imputed dataset
Imputed_data1 <- complete(imputed_data, 1)

# 4. Check missingness
str(Imputed_data1)
'data.frame':   9813 obs. of  9 variables:
 $ ID      : num  73557 73558 73559 73560 73561 ...
 $ BMI     : Factor w/ 4 levels "Underweight",..: 4 2 3 2 3 2 2 3 4 2 ...
 $ Age     : num  69 54 72 9 73 56 0 61 56 65 ...
 $ Gender  : Factor w/ 2 levels "Male","Female": 1 1 1 1 2 1 1 2 2 1 ...
 $ Race    : Factor w/ 5 levels "Mexican American",..: 4 3 3 3 3 1 3 3 3 3 ...
 $ PSU     : num  1 1 1 2 2 1 1 1 1 2 ...
 $ Strata  : num  112 108 109 109 116 111 105 114 112 112 ...
 $ Weight  : num  13481 24472 57193 55767 65542 ...
 $ Diabetes: Factor w/ 2 levels "Yes","No": 1 1 2 1 1 2 1 1 1 2 ...
Code
summary(Imputed_data1)
       ID                   BMI            Age           Gender    
 Min.   :73557   Underweight  : 218   Min.   : 0.00   Male  :4831  
 1st Qu.:76092   Normal weight:5107   1st Qu.:10.00   Female:4982  
 Median :78643   Overweight   :1831   Median :27.00                
 Mean   :78645   Obese        :2657   Mean   :31.63                
 3rd Qu.:81191                        3rd Qu.:52.00                
 Max.   :83731                        Max.   :80.00                
                                  Race           PSU            Strata     
 Mexican American                   :1685   Min.   :1.000   Min.   :104.0  
 Other Hispanic                     : 930   1st Qu.:1.000   1st Qu.:107.0  
 Non-Hispanic White                 :3538   Median :1.000   Median :111.0  
 Non-Hispanic Black                 :2198   Mean   :1.483   Mean   :110.9  
 Other Race - Including Multi-Racial:1462   3rd Qu.:2.000   3rd Qu.:115.0  
                                            Max.   :2.000   Max.   :118.0  
     Weight       Diabetes  
 Min.   :  3748   Yes:7252  
 1st Qu.: 13187   No :2561  
 Median : 20965             
 Mean   : 31713             
 3rd Qu.: 37922             
 Max.   :171395             
Code
colSums(is.na(Imputed_data1))
      ID      BMI      Age   Gender     Race      PSU   Strata   Weight 
       0        0        0        0        0        0        0        0 
Diabetes 
       0 
Code
plot_intro(Imputed_data1, title="Figure 8. Structure of variables and missing observations.")

Code
plot_bar(Imputed_data1, title = "Figure 9. Frequency plots of categorical variables.")

Code
plot_correlation(na.omit(Imputed_data1[, c("BMI", "Diabetes")]), maxcat=5L, title = "Figure")

Code
# 5. Cross-tabulation
tab <- table(Imputed_data1$BMI, Imputed_data1$Diabetes)
print(tab)
               
                 Yes   No
  Underweight    159   59
  Normal weight 3760 1347
  Overweight    1342  489
  Obese         1991  666
Code
# Chi-square test
chisq.test(tab)

    Pearson's Chi-squared test

data:  tab
X-squared = 2.1289, df = 3, p-value = 0.5461
Code
# Cross-tabulation

# BMI vs Diabetes
tab_BMI <- table(Imputed_data1$BMI, Imputed_data1$Diabetes)
print(tab_BMI)
               
                 Yes   No
  Underweight    159   59
  Normal weight 3760 1347
  Overweight    1342  489
  Obese         1991  666
Code
prop.table(tab_BMI, 1) * 100  # row percentages
               
                     Yes       No
  Underweight   72.93578 27.06422
  Normal weight 73.62444 26.37556
  Overweight    73.29328 26.70672
  Obese         74.93414 25.06586
Code
# Gender vs Diabetes
tab_gender <- table(Imputed_data1$Gender, Imputed_data1$Diabetes)
prop.table(tab_gender, 1) * 100
        
              Yes       No
  Male   72.92486 27.07514
  Female 74.84946 25.15054
Code
# Race vs Diabetes
tab_race <- table(Imputed_data1$Race, Imputed_data1$Diabetes)
prop.table(tab_race, 1) * 100
                                     
                                           Yes       No
  Mexican American                    71.03858 28.96142
  Other Hispanic                      74.40860 25.59140
  Non-Hispanic White                  76.17298 23.82702
  Non-Hispanic Black                  72.47498 27.52502
  Other Race - Including Multi-Racial 73.52941 26.47059
Code
# Age vs Diabetes
tab_age <- table(Imputed_data1$Age, Imputed_data1$Diabetes)
head (prop.table(tab_age, 1) * 100)
   
         Yes       No
  0 74.93606 25.06394
  1 74.89879 25.10121
  2 72.51908 27.48092
  3 74.64115 25.35885
  4 73.48837 26.51163
  5 67.17172 32.82828
Code
# Breakdown of Diabetes within BMI
breakdown_BMI <- Imputed_data1 %>%
  group_by(BMI, Diabetes) %>%
  summarise(Count = n(), .groups = "drop") %>%
  group_by(BMI) %>%
  mutate(
    Percent = round(100 * Count / sum(Count), 1)
  )
breakdown_BMI
# A tibble: 8 × 4
# Groups:   BMI [4]
  BMI           Diabetes Count Percent
  <fct>         <fct>    <int>   <dbl>
1 Underweight   Yes        159    72.9
2 Underweight   No          59    27.1
3 Normal weight Yes       3760    73.6
4 Normal weight No        1347    26.4
5 Overweight    Yes       1342    73.3
6 Overweight    No         489    26.7
7 Obese         Yes       1991    74.9
8 Obese         No         666    25.1
Code
# 6. Frequency tables for categorical variables
categorical_vars <- c("BMI", "Gender", "Race", "Diabetes")

for (var in categorical_vars) {
  cat("\nFrequency table for", var, ":\n")
  print(table(Imputed_data1[[var]]))
  print(round(prop.table(table(Imputed_data1[[var]])), 3))
}

Frequency table for BMI :

  Underweight Normal weight    Overweight         Obese 
          218          5107          1831          2657 

  Underweight Normal weight    Overweight         Obese 
        0.022         0.520         0.187         0.271 

Frequency table for Gender :

  Male Female 
  4831   4982 

  Male Female 
 0.492  0.508 

Frequency table for Race :

                   Mexican American                      Other Hispanic 
                               1685                                 930 
                 Non-Hispanic White                  Non-Hispanic Black 
                               3538                                2198 
Other Race - Including Multi-Racial 
                               1462 

                   Mexican American                      Other Hispanic 
                              0.172                               0.095 
                 Non-Hispanic White                  Non-Hispanic Black 
                              0.361                               0.224 
Other Race - Including Multi-Racial 
                              0.149 

Frequency table for Diabetes :

 Yes   No 
7252 2561 

  Yes    No 
0.739 0.261 
Code
# 7. Summary statistics for continuous variables
continuous_vars <- c("Age")

for (var in continuous_vars) {
  cat("\nSummary statistics for", var, ":\n")
  print(summary(Imputed_data1[[var]]))
  print(paste("SD:", round(sd(Imputed_data1[[var]]), 2)))
}

Summary statistics for Age :
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   10.00   27.00   31.63   52.00   80.00 
[1] "SD: 24.4"
Code
# 8. Bar plots for categorical variables
for (var in categorical_vars) {
  ggplot(Imputed_data1, aes_string(x = var)) +
    geom_bar(fill = "steelblue") +
    labs(title = paste("Bar plot of", var), y = "Count") +
    theme_minimal() -> p
  print(p)
}

Code
# 9. Histograms for continuous variables
for (var in continuous_vars) {
  ggplot(Imputed_data1, aes_string(x = var)) +
    geom_histogram(fill = "skyblue", color = "black", bins = 30) +
    labs(title = paste("Histogram of", var), y = "Frequency") +
    theme_minimal() -> p
  print(p)
}

Code
# 10. Scatter plot example (BMI vs Age)
ggplot(Imputed_data1, aes(x = Age, y = BMI, color = BMI)) +
  geom_point(alpha = 0.6) +
  labs(title = "Scatter plot of BMI vs Age", y = "BMI", x = "Age") +
  theme_minimal()

Code
# 11. Relative breakdown of Diabetes by BMI
ggplot(breakdown_BMI, aes(x = BMI, y = Percent, fill = Diabetes)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(
    title = "Relative Breakdown of Diabetes by BMI Category",
    x = "BMI Category", y = "Proportion"
  ) +
  theme_minimal()

Code
# 12. Crosstab with percentages
Imputed_data1 %>% 
  tabyl(Diabetes, BMI) %>% 
  adorn_percentages("col")
 Diabetes Underweight Normal weight Overweight     Obese
      Yes   0.7293578     0.7362444  0.7329328 0.7493414
       No   0.2706422     0.2637556  0.2670672 0.2506586
Code
# 13. Margin plot for BMI vs Diabetes
marginplot(
  Imputed_data1[, c("BMI", "Diabetes")],
  col = mdc(1:2, trans = FALSE),
  cex = 1.2,
  cex.lab = 1.2,
  cex.numbers = 1.3,
  pch = 19
)

Results from MICE:

  • MI resulted in filling imputed values with resulting 9813 observations with no NAs. A comparative bar plot on missingness in the raw data and the imputed data is presented below.

  • A heatmap of the imputed dataset generated a correlation between BMI categories and Diabetes status. BMI dummy variables are strongly negatively correlated.

  • There was no strong linear association between BMI category and diabetes in the dataset. Chi-square calculation of categorical varaibels revealed p-value = 0.5461, which is > 0.05 with no evidence of association. Imputed data check in marginal plot- The margin plot shows that the distribution of imputed points is consistent with observed data (no strange outliers)

Modeling

The Logistic regression model is:

\[ \text{logit}(P(Y_i=1)) = \beta_0 + \beta_1 \cdot Age_i + \beta_2 \cdot BMI_i + \beta_3 \cdot Race_i + \beta_4 \cdot Gender_i) \]

Linear Regression equation:

\[ y_i = \beta_0 + \beta_1 X_1 +\varepsilon_i \]

Comparison of multiple imputation and Bayesian data augmentation

Multiple imputation Bayesian data augmentation
  • frequentist approach and requires no priors, and has moderate flexibility
  • performs missing data imputation and regression model fitting simultaneously

  • Markov Chain Monte Carlo (MCMC) draws samples from the joint posterior of regression parameters, missing values and provide complete datasets by extracting posterior means, credible intervals, and probabilities

  • handles missing values first by imputation, performs regression analysis, pools results
  • performed on the data with missingness

  • shrink extreme estimates back toward plausible values

  • propagate uncertainty added after analysis (pooling).
  • handles uncertainty in missing values fully propagated through the model, naturally handles small or sparse datasets and separation problems.

Diagnostics performed before regression analysis

  1. Modeling assumption check
  2. Correlation matrix, Cook’s distance, influential points
  3. Model plot
Code
# MLR on imputed data
# assumption check (# for correlation # )
# correlation matrix
# Frequentist logistic regression on imputed data

m_imp <- glm(Diabetes ~ Age + Gender + Race + BMI,
             data = Imputed_data1,
             family = binomial)
summary(m_imp)

Call:
glm(formula = Diabetes ~ Age + Gender + Race + BMI, family = binomial, 
    data = Imputed_data1)

Coefficients:
                                         Estimate Std. Error z value Pr(>|z|)
(Intercept)                             -0.761926   0.162425  -4.691 2.72e-06
Age                                     -0.004728   0.001011  -4.678 2.89e-06
GenderFemale                            -0.092730   0.046133  -2.010  0.04442
RaceOther Hispanic                      -0.154250   0.092549  -1.667  0.09558
RaceNon-Hispanic White                  -0.212663   0.067746  -3.139  0.00169
RaceNon-Hispanic Black                  -0.055537   0.072128  -0.770  0.44131
RaceOther Race - Including Multi-Racial -0.109388   0.080345  -1.361  0.17336
BMINormal weight                         0.024339   0.156361   0.156  0.87630
BMIOverweight                            0.071544   0.162658   0.440  0.66005
BMIObese                                 0.020287   0.161137   0.126  0.89981
                                           
(Intercept)                             ***
Age                                     ***
GenderFemale                            *  
RaceOther Hispanic                      .  
RaceNon-Hispanic White                  ** 
RaceNon-Hispanic Black                     
RaceOther Race - Including Multi-Racial    
BMINormal weight                           
BMIOverweight                              
BMIObese                                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 11267  on 9812  degrees of freedom
Residual deviance: 11219  on 9803  degrees of freedom
AIC: 11239

Number of Fisher Scoring iterations: 4
Code
coef(m_imp)
                            (Intercept)                                     Age 
                           -0.761925862                            -0.004727998 
                           GenderFemale                      RaceOther Hispanic 
                           -0.092730385                            -0.154250348 
                 RaceNon-Hispanic White                  RaceNon-Hispanic Black 
                           -0.212662854                            -0.055536906 
RaceOther Race - Including Multi-Racial                        BMINormal weight 
                           -0.109388375                             0.024339471 
                          BMIOverweight                                BMIObese 
                            0.071543638                             0.020286793 
Code
confint(m_imp)
                                               2.5 %       97.5 %
(Intercept)                             -1.086867141 -0.449109500
Age                                     -0.006713516 -0.002751496
GenderFemale                            -0.183172208 -0.002320639
RaceOther Hispanic                      -0.336602276  0.026303770
RaceNon-Hispanic White                  -0.345137536 -0.079535197
RaceNon-Hispanic Black                  -0.196779898  0.086003952
RaceOther Race - Including Multi-Racial -0.267119094  0.047893182
BMINormal weight                        -0.276189156  0.337871931
BMIOverweight                           -0.241855094  0.396816724
BMIObese                                -0.289985572  0.342735874
Code
# Log-odds (link)
Imputed_data1$logit <- predict(m_imp, type = "link") ## log (Odds) 

# Probability
Imputed_data1$prob <- exp(Imputed_data1$logit) / (1 + exp(Imputed_data1$logit)) # prob 

# Plot predicted probability vs Age
ggplot(Imputed_data1, aes(x = Age, y = prob)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "loess") +
  labs(x = "Age", y = "Predicted Probability of Diabetes")

Findings from regression of MI data set

  1. MLR on imputed data (Frequentist approach)
  • Relationship between Age and log-odds of diabetes are roughly linear but not perfectly, but are acceptable for logistic regression assumptions.

  • Generalized Variance Inflation Factor (vif- adjusted report there is no collinearity between predictors (GVIF between ~1.0–1.04) and we can run model without removing or dropping or combining variables.

  • Cooks distance and influential points, we found - Most data points are safe, not influencing the model In the data with (~9813 cases), cutoff ≈ 0.0004. A cluster at high leverage shows unusual predictor values, but not high influence. A few above Cook’s Distance cutoff: worth checking individually, but no major threat to model stability. no outliers detected (not suspected = 9813)

  • Results from Hosmer–Lemeshow (H–L) at alpha =0.05, with p < 0.001, we find our logistic regression model does not fit the data well.

  • Graph shows Residual vs fitted (imputed data model)

  1. Results from Bayesian Data Augmentation and logistic regression
  • We incorporate prior knowledge that BMI increases diabetes odds by .,
  • We use priors for Bayesian logistic regression and compare the models with different priors in the model
    • Prior (intercept) - We use intercept prior from this study data
    • Prior (coefficients) - BMI, Age, gender
      • Weak prior N (0,2.5) -βBMI∼N(μ,σ2)
      • A common approach is to use a normal distribution, βBMI∼N(μ,σ2), for the regression coefficient. 
      • informative prior from previous studies βBMI∼N(μ,σ2) , βAge∼N(μ,σ2), βgender∼N(μ,σ2), βrace ∼N(μ,σ2)

For males, the informative prior Ali et al. (2024), we use is

Normal(μ = 1.705, σ² = 0.448²).

Code
# Fitted values and residuals
fitted_imputed1 <- fitted(m_imp)
residual_imputed1 <- residuals(m_imp)

# Residuals vs Fitted plot
plot(fitted_imputed1, residual_imputed1,
     xlab = "Fitted probabilities",
     ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red", lty = 2)

Code
# Collinearity check
library(car)
vif(m_imp)  # VIF > 5 indicates multicollinearity
           GVIF Df GVIF^(1/(2*Df))
Age    1.108163  1        1.052693
Gender 1.002032  1        1.001015
Race   1.040913  4        1.005025
BMI    1.078206  3        1.012629
Code
# Influential points
library(broom)
influence_m_imp <- broom::augment(m_imp)

# Plot Cook's distance
ggplot(influence_m_imp, aes(x = seq_along(.cooksd), y = .cooksd)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  geom_hline(yintercept = 4 / nrow(influence_m_imp), color = "red", linetype = "dashed") +
  labs(x = "Observation", y = "Cook's Distance", title = "Influential Points (Cook's Distance)") +
  theme_minimal()

Code
influence_m_imp <- influence_m_imp %>%
  mutate(outlier = ifelse(abs(.std.resid) > 2, TRUE, FALSE))

# Cook's distance plot
ggplot(influence_m_imp, aes(x = .hat, y = .cooksd)) +
  geom_point() +
  labs(x = "Leverage", y = "Cook's Distance")

Code
###   Transform Response, check for Goodness-of-Fit   ###

# Numeric response
Imputed_data1$Diabetes_num <- ifelse(Imputed_data1$Diabetes == "Yes", 1, 0)

# Hosmer-Lemeshow test

library(ResourceSelection)
hoslem.test(Imputed_data1$Diabetes_num, fitted(m_imp))

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  Imputed_data1$Diabetes_num, fitted(m_imp)
X-squared = 12190, df = 8, p-value < 2.2e-16
Code
# ANOVA for model
anova(m_imp)
Analysis of Deviance Table

Model: binomial, link: logit

Response: Diabetes

Terms added sequentially (first to last)

       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    9812      11267              
Age     1  31.1696      9811      11236 2.364e-08 ***
Gender  1   3.9271      9810      11232   0.04751 *  
Race    4  12.2392      9806      11220   0.01566 *  
BMI     3   0.7227      9803      11219   0.86784    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
anova(m_imp, test = "Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: Diabetes

Terms added sequentially (first to last)

       Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                    9812      11267              
Age     1  31.1696      9811      11236 2.364e-08 ***
Gender  1   3.9271      9810      11232   0.04751 *  
Race    4  12.2392      9806      11220   0.01566 *  
BMI     3   0.7227      9803      11219   0.86784    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
# Adjusted R²
library(glmtoolbox)
adjR2(m_imp)
[1] 0.00335
Code
# Residual vs fitted for imputed data
plot(m_imp$fitted.values, resid(m_imp),
     xlab = "Fitted values",
     ylab = "Residuals",
     main = "Residuals vs Fitted",
     pch = 19, col = "blue")
abline(h = 0, col = "red", lwd = 2)

Hosmer-Lemeshow test - was conducted to test for Goodness of fit of multivariate logistic regression model adjR2(m_imp) CHi-square test Visualization of the model (fitted vs residula values)

To overcome the quasi-separation issue in the data, Firth (penalized regression model) was conducted and the summary presented with only 14 complete observations.

Next, to deal with the small sample size, imputation (MICE) was conducted along with the regression predicting Diabetes ~ BMI, Age, Gender, Race.

Summar and visualization of one dataset extracted from the 5 datasets from MICE is presented here with along wiht the predicted values of the imputed model (m_imp), and plots of cross-tabulation between variables and response variable (Diabetes)

Code
# Frequentist logistic regression on raw datd - Firth logistic regression (penalized regression)
library(logistf)

m_firth <- logistf(Diabetes ~ BMI + Age + Gender + Race,
                   data = merged_data)
summary(m_firth)
logistf(formula = Diabetes ~ BMI + Age + Gender + Race, data = merged_data)

Model fitted by Penalized ML
Coefficients:
                                              coef  se(coef)  lower 0.95
(Intercept)                              2.2909182 2.0520292  -1.7059362
BMIOverweight                            2.5151817 1.9838969  -1.2146946
BMIObese                                -0.4519637 1.8778881  -5.4843160
Age                                     -0.2545704 0.1911568  -1.4579663
GenderFemale                            -2.1675543 2.0190594 -17.0316807
RaceOther Hispanic                       0.9518795 1.9225099 -12.0194116
RaceNon-Hispanic White                  -2.5671499 2.2189914 -21.6563170
RaceNon-Hispanic Black                   3.3280688 2.2005227  -0.4678607
RaceOther Race - Including Multi-Racial  1.3072954 2.0609641  -2.6483209
                                        upper 0.95      Chisq          p method
(Intercept)                             16.1697706 1.25690744 0.26223728      2
BMIOverweight                           21.9385716 1.68994830 0.19360778      2
BMIObese                                 3.4687761 0.05538832 0.81393924      2
Age                                      0.1148471 1.81192970 0.17827692      2
GenderFemale                             5.9967059 0.69556457 0.40427809      2
RaceOther Hispanic                      10.8617660 0.20246895 0.65273532      2
RaceNon-Hispanic White                   1.6700624 1.36287409 0.24304000      2
RaceNon-Hispanic Black                  25.3575974 2.86971841 0.09026066      2
RaceOther Race - Including Multi-Racial 21.7820826 0.39215076 0.53117103      2

Method: 1-Wald, 2-Profile penalized log-likelihood, 3-None

Likelihood ratio test=6.066658 on 8 df, p=0.6397653, n=14
Wald test = 5.257182 on 8 df, p = 0.7297677
Code
# Imputed data plots ## pred_prob_imputed #
 
Imputed_data1 <- Imputed_data1 %>%
  mutate(pred_prob_imputed = predict(m_imp, type = "response")) # predicted probabilities

Imputed_plot <- Imputed_data1 %>% select(BMI, pred_prob_imputed) %>% mutate(Source = "Imputed")


# Rename probability column to common name
Imputed_plot <- Imputed_plot %>% rename(Pred_Prob = pred_prob_imputed)


ggplot(Imputed_data1, aes(x = BMI, y = pred_prob_imputed, fill = BMI)) +
  geom_boxplot(alpha = 0.6) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 1) +
  labs(x = "BMI Category", y = "Predicted Probability of Diabetes",
       title = "Predicted Diabetes Probability by BMI Category") +
  theme_minimal()

Code
merged_data_clean <- merged_data %>%
  filter(!is.na(BMI), !is.na(Diabetes))

ggplot(merged_data_clean, aes(x = BMI, fill = factor(Diabetes))) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "BMI Category", y = "Proportion with Diabetes",
       fill = "Diabetes",
       title = "Observed Diabetes Proportions by BMI Category") +
  theme_minimal()

Code
ggplot(Imputed_data1, aes(x = Race, y = pred_prob_imputed, fill = Race)) +
  geom_boxplot(alpha = 0.6) +
  geom_jitter(width = 0.2, alpha = 0.3, size = 1) +
  labs(x = "Race ", y = "Predicted Probability of Diabetes",
       title = "Predicted Diabetes Probability by Race ") +
  theme_minimal()

Code
merged_data_clean_Race <- merged_data %>%
  filter(!is.na(Race), !is.na(Diabetes))

ggplot(merged_data_clean, aes(x = Race, fill = factor(Diabetes))) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "Race ", y = "Proportion with Diabetes",
       fill = "Diabetes",
       title = "Observed Diabetes Proportions by Race ") +
  theme_minimal()

Proportion of Diabetes status and the group category (age <40 and >40) is tabulated below

Code
ggplot(merged_data, aes(x = Age, y = Diabetes)) + 
  geom_jitter(size = 0.2)

Code
ggplot(Imputed_data1, aes(x = Age, y = Diabetes)) + 
  geom_jitter(size = 0.2)

Code
            # Create age groups
# Create contingency table with Diabetes

Imputed_data1$Age_group <- ifelse(Imputed_data1$Age < 40, "<40", ">=40")

tab_age <- table(Imputed_data1$Age_group, Imputed_data1$Diabetes)
prop_age <- prop.table(tab_age, 1) * 100

tab_age
      
        Yes   No
  <40  4414 1691
  >=40 2838  870
Code
prop_age
      
            Yes       No
  <40  72.30139 27.69861
  >=40 76.53722 23.46278
Code
# Convert table to data frame
df_age <- as.data.frame(tab_age)
names(df_age) <- c("Age_group", "Diabetes", "Count")  # rename columns
Code
## Reference: Gelman et al., 2008, “Weakly informative priors: Normal(0, 2.5) for coefficients (b) and Normal(0, 5) for the intercept as default weakly informative priors for logistic regression ##
# bayesian logitic regression ## 
library(brms)

priors <- c(
  set_prior("normal(0, 2.5)", class = "b"),        # for coefficients
  set_prior("normal(0, 5)", class = "Intercept")   # for intercept
)


formula_bayes <- bf(Diabetes ~ Age + BMI + Gender + Race)

Diabetes_prior <- brm(
  formula = formula_bayes,
  data = Imputed_data1,
  family = bernoulli(link = "logit"),   # logistic regression
  prior = priors,
  chains = 4,
  iter = 2000,
  seed = 123,
  control = list(adapt_delta = 0.95)
)
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG   -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported"  -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/"  -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/"  -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include    -fpic  -g -O2  -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
                 from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
                 from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
                 from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
  679 | #include <cmath>
      |          ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.00043 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 4.3 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 14.641 seconds (Warm-up)
Chain 1:                8.9 seconds (Sampling)
Chain 1:                23.541 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0.000464 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 4.64 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 18.732 seconds (Warm-up)
Chain 2:                6.138 seconds (Sampling)
Chain 2:                24.87 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 0.000369 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 3.69 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 10.662 seconds (Warm-up)
Chain 3:                7.065 seconds (Sampling)
Chain 3:                17.727 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 0.000499 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 4.99 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 16.088 seconds (Warm-up)
Chain 4:                8.432 seconds (Sampling)
Chain 4:                24.52 seconds (Total)
Chain 4: 
Code
## Bayes model summary
summary(Diabetes_prior)
 Family: bernoulli 
  Links: mu = logit 
Formula: Diabetes ~ Age + BMI + Gender + Race 
   Data: Imputed_data1 (Number of observations: 9813) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Regression Coefficients:
                                    Estimate Est.Error l-95% CI u-95% CI Rhat
Intercept                              -0.77      0.16    -1.09    -0.44 1.00
Age                                    -0.00      0.00    -0.01    -0.00 1.00
BMINormalweight                         0.03      0.16    -0.27     0.34 1.00
BMIOverweight                           0.08      0.16    -0.24     0.40 1.00
BMIObese                                0.03      0.16    -0.29     0.35 1.00
GenderFemale                           -0.09      0.05    -0.18    -0.00 1.00
RaceOtherHispanic                      -0.15      0.09    -0.34     0.03 1.00
RaceNonMHispanicWhite                  -0.21      0.07    -0.34    -0.08 1.00
RaceNonMHispanicBlack                  -0.06      0.07    -0.20     0.09 1.00
RaceOtherRaceMIncludingMultiMRacial    -0.11      0.08    -0.26     0.05 1.00
                                    Bulk_ESS Tail_ESS
Intercept                               1834     1902
Age                                     4495     2737
BMINormalweight                         1820     1960
BMIOverweight                           1899     2263
BMIObese                                1884     1952
GenderFemale                            3564     2756
RaceOtherHispanic                       2685     2528
RaceNonMHispanicWhite                   2156     2917
RaceNonMHispanicBlack                   2444     2590
RaceOtherRaceMIncludingMultiMRacial     2311     2896

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
plot(Diabetes_prior) 

Code
## Draws 

# Generate fitted draws directly with brms
fitted_draws <- fitted(
  Diabetes_prior,
  newdata = Imputed_data1,
  summary = FALSE,   # gives all posterior draws instead of summary
  nsamples = 100     # limit to 100 draws
)

# Convert to long format manually


fitted_long <- data.frame(
  BMI = rep(Imputed_data1$BMI, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)

### BMI Plot the fitted lines
ggplot(fitted_long, aes(x = BMI, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(y = "Predicted probability of Diabetes")

Code
### Age
fitted_long <- data.frame(
  Age = rep(Imputed_data1$Age, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)
ggplot(fitted_long, aes(x = Age, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(y = "Predicted probability of Diabetes")

Code
### Race

fitted_long <- data.frame(
  Race = rep(Imputed_data1$Race, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)
ggplot(fitted_long, aes(x = Race, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(y = "Predicted probability of Diabetes")

Code
### Gender
fitted_long <- data.frame(
  Gender = rep(Imputed_data1$Gender, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)
ggplot(fitted_long, aes(x = Gender, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(y = "Predicted probability of Diabetes")

Code
### Age cut by 10 years and Diabetes plot and histogram (imputed data)

 Imputed_data1 %>% 
  mutate(age_bracket = 
           cut(Age, breaks = seq(10, 100, by = 10))) %>% 
  group_by(age_bracket) %>% 
  summarise(Diabetes = mean(Diabetes == "Yes")) %>% 
  ggplot(aes(x = age_bracket, y = Diabetes)) + 
    geom_point() + 
    theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

Code
### predict values of 100 draws, Simulated predictions (binary diabetes outcome)

pred <- posterior_predict(Diabetes_prior, newdata = Imputed_data1, draws = 100)

# data frame for summarizing
pred_df <- as.data.frame(t(pred)) 

# proportion of diabetes = 1 per draw
prop_diabetes <- colMeans(pred_df == 1)


prop_df <- tibble(
  draw = 1:length(prop_diabetes),
  proportion_Diabetes = prop_diabetes    ## proportion of Diabetes with age cut category
)

library(ggplot2)
ggplot(prop_df, aes(x = proportion_Diabetes)) +
  geom_histogram(color = "white")

Bayesian Logistic Regression Model - prior (weakly informative prior used) - complilation, iterations, and posterior draws using NUTS sampling - Fitted draws from the model posterior (n=100) were analyzed - estimates, Rhat were analyzed for convergence. - plots are presented below - Histogram of predicted values (n=100 draws), shows observed proportion of Diabetes ostatus. - - Scatterplot of the proportion of Diabetes grouped by age.

We created posterior model from the posterior draws (100), and analysed our simulated prior model. - plotted posterior predicted values of Diabetes against Age, Race and BMI

Code
library(brms)
library(GGally)

# Simulate the model


Diabetes_model_1 <- update(Diabetes_prior,sample_prior = "yes"   # includes priors + data likelihood
)
Running /opt/R/4.4.2/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-2)’
gcc -I"/opt/R/4.4.2/lib/R/include" -DNDEBUG   -I"/opt/R/4.4.2/lib/R/library/Rcpp/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppEigen/include/unsupported"  -I"/opt/R/4.4.2/lib/R/library/BH/include" -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/src/"  -I"/opt/R/4.4.2/lib/R/library/StanHeaders/include/"  -I"/opt/R/4.4.2/lib/R/library/RcppParallel/include/"  -I"/opt/R/4.4.2/lib/R/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/usr/local/include    -fpic  -g -O2  -c foo.c -o foo.o
In file included from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Core:19,
                 from /opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/Dense:1,
                 from /opt/R/4.4.2/lib/R/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22,
                 from <command-line>:
/opt/R/4.4.2/lib/R/library/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: cmath: No such file or directory
  679 | #include <cmath>
      |          ^~~~~~~
compilation terminated.
make: *** [/opt/R/4.4.2/lib/R/etc/Makeconf:195: foo.o] Error 1

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
Chain 1: 
Chain 1: Gradient evaluation took 0.000382 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 3.82 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1: 
Chain 1: 
Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 1: 
Chain 1:  Elapsed Time: 17.283 seconds (Warm-up)
Chain 1:                7.108 seconds (Sampling)
Chain 1:                24.391 seconds (Total)
Chain 1: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
Chain 2: 
Chain 2: Gradient evaluation took 0.000338 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 3.38 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2: 
Chain 2: 
Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 2: 
Chain 2:  Elapsed Time: 18.162 seconds (Warm-up)
Chain 2:                7.98 seconds (Sampling)
Chain 2:                26.142 seconds (Total)
Chain 2: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
Chain 3: 
Chain 3: Gradient evaluation took 0.000346 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 3.46 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3: 
Chain 3: 
Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 3: 
Chain 3:  Elapsed Time: 14.702 seconds (Warm-up)
Chain 3:                7.021 seconds (Sampling)
Chain 3:                21.723 seconds (Total)
Chain 3: 

SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
Chain 4: 
Chain 4: Gradient evaluation took 0.000327 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 3.27 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4: 
Chain 4: 
Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
Chain 4: 
Chain 4:  Elapsed Time: 15.011 seconds (Warm-up)
Chain 4:                5.381 seconds (Sampling)
Chain 4:                20.392 seconds (Total)
Chain 4: 
Code
# BMI
# Posterior fitted values (probabilities of Diabetes)
fitted_draws <- fitted(
  Diabetes_model_1,         # <-- use posterior model here
  newdata = Imputed_data1,
  summary = FALSE,          # full posterior draws
  nsamples = 100            # sample 100 posterior draws
)

# Reshape into long format
fitted_long <- data.frame(
  BMI = rep(Imputed_data1$BMI, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)

# Plot posterior fitted lines (BMI)
library(ggplot2)
ggplot(fitted_long, aes(x = BMI, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(
    y = "Predicted probability of Diabetes",
    title = "Posterior fitted draws by BMI"
  )

Code
# Age
# Reshape into long format

fitted_long <- data.frame(
  Age = rep(Imputed_data1$Age, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)

# Plot posterior fitted lines
library(ggplot2)
ggplot(fitted_long, aes(x = Age, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(
    y = "Predicted probability of Diabetes",
    title = "Posterior fitted draws by Age"
  )

Code
# Race
# Reshape into long format

fitted_long <- data.frame(
  Race = rep(Imputed_data1$Race, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)

# Plot posterior fitted lines
library(ggplot2)
ggplot(fitted_long, aes(x = Race, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(
    y = "Predicted probability of Diabetes",
    title = "Posterior fitted draws by Race"
  )

Code
fitted_long <- data.frame(
  Gender = rep(Imputed_data1$Gender, times = 100),
  draw = rep(1:100, each = nrow(Imputed_data1)),
  .value = as.vector(t(fitted_draws))
)

# Plot posterior fitted lines
library(ggplot2)
ggplot(fitted_long, aes(x = Gender, y = .value, group = draw)) +
  geom_line(size = 0.1, alpha = 0.4) +
  labs(
    y = "Predicted probability of Diabetes",
    title = "Posterior fitted draws by Gender"
  )

Code
# MCMC trace, density, & autocorrelation plots

plot(Diabetes_model_1, combo = c("dens", "trace"))

  • We performed Bayesian logistic regression to model the probability of diabetes as a function of age, BMI, gender, and race. -
  • We used Weakly informative normal priors Gosho et al. (2025) on all regression coefficients. Models were fit using four MCMC chains with 2000 iterations each, including 1000 warm-up iterations. Convergence was assessed using Rhat values, effective sample size, and trace plots. Posterior predictive checks were used to evaluate model fit, and coefficients were exponentiated to report odds ratios with 95% credible intervals
    The cluster of points representing the higher probability of diabetes appears to be denser among individuals in the middle to older Age ranges (e.g., roughly from 40 to 80 years old), compared to the younger Age ranges, although diabetes is present even at younger Ages.

Comparing Models

  • Linear regression model on raw data

  • Multivariate logistic regression on imputed dataset (MI + MLR)

  • Bayesian Logistic Regression on imputed data The spread of these lines provides an indication of the variability or uncertainty in the predicted probabilities within each BMI group (posterior model). the plots visually demonstrate the well-established relationship between BMI and the predicted probability of diabetes, with the risk significantly increasing as BMI moves from normal weight to overweight and obese categories

Conclusion

  • Summarize your key findings.

  • Discuss the implications of your results.

References

Ali, Sakhawat, Rizwana Hussain, Rohaib A Malik, Raheema Amin, and Muhammad N Tariq. 2024. Association of Obesity With Type 2 Diabetes Mellitus: A Hospital-Based Unmatched Case-Control Study.” Cureus 16 (1): 1–7. https://doi.org/10.7759/cureus.52728.
Buuren, Stef van, and Karin Groothuis-Oudshoorn. 2011. mice: Multivariate Imputation by Chained Equations in R.” Journal of Statistical Software 45 (3): 1–67. https://doi.org/10.18637/jss.v045.i03.
Center for Health Statistics, National. 1999. Vital and Health Statistics Reports Series 2, Number 161, September 2013.” National Health and Nutrition Examination Survey: Analytic Guidelines, 1999–2010. https://wwwn.cdc.gov/nchs/data/series/sr02_161.pdf.
Chatzimichail, Theodora, and Aristides T. Hatjimihail. 2023. A Bayesian Inference Based Computational Tool for Parametric and Nonparametric Medical Diagnosis.” Diagnostics 13 (19). https://doi.org/10.3390/DIAGNOSTICS13193135,.
D’Angelo, Gina, and Di Ran. 2025. Tutorial on Firth’s Logistic Regression Models for Biomarkers in Preclinical Space.” Pharmaceutical Statistics 24 (1): 1–8. https://doi.org/10.1002/pst.2422.
Gosho, Masahiko, Ryota Ishii, Kengo Nagashima, Hisashi Noma, and Kazushi Maruo. 2025. Determining the prior mean in Bayesian logistic regression with sparse data: a nonarbitrary approach.” Journal of the Royal Statistical Society. Series C: Applied Statistics 74 (1): 126–41. https://doi.org/10.1093/jrsssc/qlae048.
Klauenberg, Katy, Gerd Wübbeler, Bodo Mickan, Peter Harris, and Clemens Elster. 2015. A tutorial on Bayesian Normal linear regression.” Metrologia 52 (6): 878–92. https://doi.org/10.1088/0026-1394/52/6/878.
Schoot, Rens van de, Sarah Depaoli, Ruth King, Bianca Kramer, Kaspar Märtens, Mahlet G. Tadesse, Marina Vannucci, et al. 2021. Bayesian statistics and modelling.” Nature Reviews Methods Primers 1 (1): 1. https://doi.org/10.1038/s43586-020-00001-2.
Van Buuren, Stef, and Stef Van Buuren. 2012. Flexible Imputation of Missing Data. Vol. 10. CRC press Boca Raton, FL.